Goto

Collaborating Authors

 top-5 accuracy




Focused Quantization for Sparse CNNs

Neural Information Processing Systems

Deep convolutional neural networks (CNNs) are powerful tools for a wide range of vision tasks, but the enormous amount of memory and compute resources required by CNNs poses a challenge in deploying them on constrained devices. Existing compression techniques, while excelling at reducing model sizes, struggle to be computationally friendly. In this paper, we attend to the statistical properties of sparse CNNs and present focused quantization, a novel quantization strategy based on power-of-two values, which exploits the weight distributions after fine-grained pruning. The proposed method dynamically discovers the most effective numerical representation for weights in layers with varying sparsities, significantly reducing model sizes. Multiplications in quantized CNNs are replaced with much cheaper bit-shift operations for efficient inference. Coupled with lossless encoding, we build a compression pipeline that provides CNNs with high compression ratios (CR), low computation cost and minimal loss in accuracies. In ResNet-50, we achieved a 18.08x CR with only 0.24% loss in top-5 accuracy, outperforming existing compression methods. We fully compress a ResNet-18 and found that it is not only higher in CR and top-5 accuracy, but also more hardware efficient as it requires fewer logic gates to implement when compared to other state-of-the-art quantization methods assuming the same throughput.


Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

Akbarian, Fatemeh, Baninajjar, Anahita, Zhang, Yingyi, Balashankar, Ananth, Aminifar, Amir

arXiv.org Artificial Intelligence

Abstract--Multi-modal foundation models align images, text, and other modalities in a shared embedding space but remain vulnerable to adversarial illusions [35], where imperceptible perturbations disrupt cross-modal alignment and mislead downstream tasks. T o counteract the effects of adversarial illusions, we propose a task-agnostic mitigation mechanism that reconstructs the input from the attacker's perturbed input through generative models, e.g., V ariational Autoencoders (V AEs), to maintain natural alignment. T o further enhance our proposed defense mechanism, we adopt a generative sampling strategy combined with a consensus-based aggregation scheme over the outcomes of the generated samples. Our experiments on the state-of-the-art multi-modal encoders show that our approach substantially reduces the illusion attack success rates to near-zero and improves cross-modal alignment by 4% (42 46) and 11% (32 43) in unperturbed and perturbed input settings respectively, providing an effective and model-agnostic defense against adversarial illusions. Multi-modal foundation models have rapidly advanced the frontier of visual and linguistic understanding. Foundation models such as CLIP [19], ALIGN [11], and ImageBind [8] align a variety of heterogeneous modalities including images, text, and other modalities within a shared embedding space, thereby enabling zero-shot classification, cross-modal retrieval, and generative conditioning. The shared embedding space that underpins cross-modal flexibility simultaneously introduces a new attack surface, giving rise to adversarial illusions [35]. As downstream tasks directly rely on the integrity of this shared representation, even small perturbations in one modality can induce semantic misalignment across others, misleading models that depend on the embedding for retrieval, captioning, or generative conditioning. Defending against such cross-modal attacks presents unique challenges.



H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis

Lim, Seungseop, Kim, Gibaeg, Lee, Hyunkyung, Han, Wooseok, Seo, Jean, Yoo, Jaehyo, Yang, Eunho

arXiv.org Artificial Intelligence

An accurate differential diagnosis (DDx) is essential for patient care, shaping therapeutic decisions and influencing outcomes. Recently, Large Language Models (LLMs) have emerged as promising tools to support this process by generating a DDx list from patient narratives. However, existing evaluations of LLMs in this domain primarily rely on flat metrics, such as Top-k accuracy, which fail to distinguish between clinically relevant near-misses and diagnostically distant errors. To mitigate this limitation, we introduce H-DDx, a hierarchical evaluation framework that better reflects clinical relevance. H-DDx leverages a retrieval and reranking pipeline to map free-text diagnoses to ICD-10 codes and applies a hierarchical metric that credits predictions closely related to the ground-truth diagnosis. In benchmarking 22 leading models, we show that conventional flat metrics underestimate performance by overlooking clinically meaningful outputs, with our results highlighting the strengths of domain-specialized open-source models. Furthermore, our framework enhances interpretability by revealing hierarchical error patterns, demonstrating that LLMs often correctly identify the broader clinical context even when the precise diagnosis is missed.


Contrastive timbre representations for musical instrument and synthesizer retrieval

Vaillant, Gwendal Le, Molle, Yannick

arXiv.org Artificial Intelligence

Efficiently retrieving specific instrument timbres from audio mixtures remains a challenge in digital music production. This paper introduces a contrastive learning framework for musical instrument retrieval, enabling direct querying of instrument databases using a single model for both single- and multi-instrument sounds. We propose techniques to generate realistic positive/negative pairs of sounds for virtual musical instruments, such as samplers and synthesizers, addressing limitations in common audio data augmentation methods. The first experiment focuses on instrument retrieval from a dataset of 3,884 instruments, using single-instrument audio as input. Contrastive approaches are competitive with previous works based on classification pre-training. The second experiment considers multi-instrument retrieval with a mixture of instruments as audio input. In this case, the proposed contrastive framework outperforms related works, achieving 81.7\% top-1 and 95.7\% top-5 accuracies for three-instrument mixtures.


End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Zheng, Qiaoyu, Sun, Yuze, Wu, Chaoyi, Zhao, Weike, Qiu, Pengcheng, Yu, Yongguo, Sun, Kun, Wang, Yanfeng, Zhang, Ya, Xie, Weidi

arXiv.org Artificial Intelligence

Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.



Appendices A Dynamic weight sharing

Neural Information Processing Systems

A.1 Noiseless case Each neuron receives the same k -dimensional input x, and its response z From Eq. (19) it is clear that Dynamics of weight update that uses Eq. (8b) In each iteration, the input is presented for 150 ms. Realistically, all neurons can't see the same Let's also bound the input mean and noise as E Therefore, we can bound the full gradient by the sum of individual bounds (as it's the Frobenius Both plots in Figure 1 show mean negative log SNR over 10 runs, 100 output neurons each. Learning was performed via SGD with momentum of 0.95. The minimum SNR value was computed from Eq. (5). For our data, the SNR expression in Eq. (6) has The code for both runs is provided in the supplementary material.